This presents the analysis of Schizosaccharomyces start codon usage and context. This uses the S. pombe 972h- translation estimates from Duncan and Mata 2017.
We remove the Sp mitochondrial genes here because their translation is not detected by this ribosome profiling protocol.
## # A tibble: 5,164 x 7
## Gene RNA RPF RNA_noN RPF_noN TE TE_noN
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SPCC1223.02 5846. 20301. 3169. 9560. 3.47 3.02
## 2 SPBC32F12.11 4811. 19320. 2608. 6422. 4.02 2.46
## 3 SPAC26F1.06 4038. 15032. 1452. 3782. 3.72 2.60
## 4 SPBC26H8.01 2817. 13956. 1570. 7698. 4.95 4.90
## 5 SPBC1815.01 3548. 12258. 1109. 1691. 3.45 1.53
## 6 SPAC27E2.11c 8220. 11694. 9323. 20023. 1.42 2.15
## 7 SPCC13B11.01 2772. 10487. 1367. 3940. 3.78 2.88
## 8 SPBC14F5.04c 2795. 9607. 889. 1815. 3.44 2.04
## 9 SPBC19C2.07 4923. 9148. 3135. 4251. 1.86 1.36
## 10 SPAC1F8.07c 3399. 8582. 1333. 2159. 2.53 1.62
## # ... with 5,154 more rows
## # A tibble: 5,118 x 19
## Gene aATG.context aATG.pos d1.context d1.posTSS d1.posATG d1.frame
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 SPAC… ATTTCTACTGC… 1 AGATATCGC… 60 59 2
## 2 SPAC… ACGATTATAAG… 159 TTTAGTCCG… 266 107 2
## 3 SPAC… CAGTTTTTAGA… 61 CAGCAACTG… 85 24 0
## 4 SPAC… AAAAAAAAAAA… 339 GTAATAAGA… 376 37 1
## 5 SPAC… CTTAGCTATAA… 326 ACCGTCGAA… 371 45 0
## 6 SPAC… TTTCAATCCAA… 140 CCATTCCTC… 336 196 1
## 7 SPAC… AACTAATTCAA… 199 CTGCAGAGT… 236 37 1
## 8 SPAC… AAGTAGGAAAG… 72 GCAAAAAAC… 176 104 2
## 9 SPAC… TTTCCATCCAA… 1 AACAGCCCG… 29 28 1
## 10 SPAC… TCTTGTTAAAT… 315 TTAGAAATA… 364 49 1
## # ... with 5,108 more rows, and 12 more variables: d2.context <chr>,
## # d2.posTSS <dbl>, d2.posATG <dbl>, d2.frame <dbl>, u1.context <chr>,
## # u1.posTSS <dbl>, u1.posATG <dbl>, u1.frame <dbl>, u2.context <chr>,
## # u2.posTSS <dbl>, u2.posATG <dbl>, u2.frame <dbl>
## [1] "old_SPAC1556.06.2%2CSPAC1556.06" "old_SPAC2F3.13c%2CSPAC2F3"
## [1] "SPAC14C4.09" "SPAC1556.06.1" "SPAC1556.06.2" "SPAC212.05c"
## [5] "SPAC212.07c" "SPAC212.09c" "SPAC212.10" "SPAC23A1.20"
## [9] "SPAC23D3.05c" "SPAC2E12.05" "SPAC2F3.12c" "SPAC2F3.13c"
## [13] "SPAC750.08c" "SPAC977.01" "SPAC977.13c" "SPAPB24D3.05c"
## [17] "SPBC1348.11" "SPBC1706.02c" "SPBC18E5.15" "SPBC1E8.04"
## [21] "SPBC31A8.02" "SPBC460.01c" "SPBC460.02c" "SPBC460.03"
## [25] "SPBC460.04c" "SPBC460.05" "SPBCPT2R1.05c" "SPBCPT2R1.06c"
## [29] "SPBCPT2R1.07c" "SPBCPT2R1.10" "SPBPB10D8.03" "SPBPB21E7.06"
## [33] "SPBPB21E7.08" "SPCC132.05c" "SPCC1450.01c" "SPCC1494.11c"
## [37] "SPCC188.10c" "SPCC18B5.02c" "SPCC548.02c" "SPCC576.16c"
## [41] "SPCC622.17" "SPCC663.07c" "SPCC830.02" "SPCP20C8.03"
## [45] "SPMTR.01" "SPMTR.02" "SPMTR.03" "SPMTR.04"
That’s for hiTrans, the top 5% (256) translated genes by RPF TPM.
Venn diagram.
There are many duplicate cytoRibo genes in Sp.
First downstream ATG
Except for 3rd-codon-position bias.
Calculate motif score against the position weight matrix (pwm) for both narrow (-4 from ATG through to ATG) and wide (-10 from ATG) kozak consensus motif. These motifs are taken from the top 5% highly translated genes.
Using the sequence logo, details on https://en.wikipedia.org/wiki/Sequence_logo
## # A tibble: 6 x 4
## Genes ATG Width Infon
## <chr> <chr> <chr> <dbl>
## 1 All aATG narrow 0.826
## 2 HiTrans aATG narrow 2.47
## 3 CytoRibo aATG narrow 2.92
## 4 All d1ATG narrow 0.182
## 5 HiTrans d1ATG narrow 0.0885
## 6 CytoRibo d1ATG narrow 0.106
Information content in bits of highly-translated consensus (excluding 6 bits from ATG), narrow is 0.72, of wide is 8.21.
We calculate scores using Biostrings::PWMscoreStartingAt.
The best description I could find of this method is: https://support.bioconductor.org/p/61520/
It is just the sum of the matrix product of the PWM with the sequence.
Write scores to file scores_kozak_Sp.txt.
## # A tibble: 5,118 x 11
## Gene aATG.scorekn d1.scorekn u1.scorekn aATG.scorekw d1.scorekw
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SPAC… 0.816 0.756 NA 0.810 0.742
## 2 SPAC… 0.867 0.799 0.957 0.849 0.767
## 3 SPAC… 0.920 0.748 NA 0.891 0.722
## 4 SPAC… 0.938 0.887 NA 0.940 0.877
## 5 SPAC… 0.952 0.920 0.748 0.907 0.855
## 6 SPAC… 0.968 0.787 NA 0.944 0.792
## 7 SPAC… 0.948 0.848 NA 0.939 0.767
## 8 SPAC… 0.907 0.890 0.840 0.832 0.900
## 9 SPAC… 0.948 0.851 NA 0.911 0.798
## 10 SPAC… 0.898 0.887 NA 0.868 0.876
## # ... with 5,108 more rows, and 5 more variables: u1.scorekw <dbl>,
## # d1vsan <dbl>, u1vsan <dbl>, d1vsaw <dbl>, u1vsaw <dbl>
There are definitely correlations there.
We suspect that uATG is associated with lower TE if the uATG has
This figure shows that, for genes with only 1 uATG, this correlation is weak and opposite to expected.
Red: high dATG vs aATG Kozak score. Blue: highly translated. Purple: both.
R = -0.04
Red: high dATG vs aATG Kozak score. Blue: highly translated. Purple: both.
R = -0.043
Those genes are in this list:
## # A tibble: 400 x 3
## Gene aATG.scorekn d1.scorekn
## <chr> <dbl> <dbl>
## 1 SPCC1620.12c 0.711 1.000
## 2 SPBC9B6.11c 0.688 0.960
## 3 SPBC13G1.01c 0.722 0.991
## 4 SPAC222.04c 0.732 1.000
## 5 SPAC26F1.05 0.724 0.991
## 6 SPBC1289.10c 0.695 0.959
## 7 SPAC1F5.07c 0.738 1.000
## 8 SPAC1B2.06 0.740 1.000
## 9 SPAC12B10.01c 0.732 0.991
## 10 SPAC7D4.13c 0.734 0.991
## # ... with 390 more rows
Some of these are decently translated (top 25%)
## # A tibble: 32 x 3
## Gene aATG.scorekn d1.scorekn
## <chr> <dbl> <dbl>
## 1 SPCC1672.02c 0.738 0.960
## 2 SPBC2G2.12 0.692 0.905
## 3 SPBC1198.08 0.724 0.935
## 4 SPAC3H1.05 0.750 0.952
## 5 SPBC25H2.16c 0.812 1.000
## 6 SPAC1565.08 0.747 0.934
## 7 SPBC1677.03c 0.724 0.908
## 8 SPBC146.13c 0.711 0.887
## 9 SPBC11C11.05 0.764 0.934
## 10 SPBC582.03 0.787 0.957
## # ... with 22 more rows
Several things involved in mRNA 3’ end regulation. Otheriwse unclear. We should check if those ATGs are actually used.
Files with high difference in narrow score, filtered for top 50% of RNA, in frame. Saved to dvsaATG_highdiffn_inframe_Sp.txt.
## # A tibble: 30 x 3
## Gene aATG.scorekn d1.scorekn
## <chr> <dbl> <dbl>
## 1 SPBC9B6.11c 0.688 0.960
## 2 SPAC12B10.01c 0.732 0.991
## 3 SPAC3G6.04 0.754 0.991
## 4 SPBC428.01c 0.701 0.925
## 5 SPCC1672.02c 0.738 0.960
## 6 SPBC2G2.12 0.692 0.905
## 7 SPBC1198.08 0.724 0.935
## 8 SPBP35G2.14 0.692 0.896
## 9 SPAC3F10.05c 0.732 0.935
## 10 SPAC3H1.05 0.750 0.952
## # ... with 20 more rows
Files with high difference in narrow score, filtered for top 50% of RNA, out of frame. Saved to dvsaATG_highdiffn_outframe_Sp.txt.
## # A tibble: 127 x 3
## Gene aATG.scorekn d1.scorekn
## <chr> <dbl> <dbl>
## 1 SPBC13G1.01c 0.722 0.991
## 2 SPAC222.04c 0.732 1.000
## 3 SPBC1289.10c 0.695 0.959
## 4 SPAC11E3.01c 0.738 0.991
## 5 SPCC31H12.08c 0.721 0.968
## 6 SPAC1071.01c 0.746 0.991
## 7 SPBC28F2.10c 0.758 1.000
## 8 SPCC663.10 0.750 0.991
## 9 SPCC1739.14 0.714 0.948
## 10 SPBC543.09 0.767 1.000
## # ... with 117 more rows
Files with high difference in wide score, filtered for top 50% of RNA, in frame. Saved to dvsaATG_highdiffw_inframe_Sp.txt.
## # A tibble: 30 x 3
## Gene aATG.scorekw d1.scorekw
## <chr> <dbl> <dbl>
## 1 SPAC12B10.01c 0.729 0.978
## 2 SPBC9B6.11c 0.681 0.928
## 3 SPAC1565.08 0.690 0.907
## 4 SPAC3F10.05c 0.715 0.929
## 5 SPBC1198.08 0.699 0.910
## 6 SPAC3G6.04 0.750 0.958
## 7 SPBC428.01c 0.706 0.912
## 8 SPBC11C11.05 0.714 0.902
## 9 SPBC146.12 0.654 0.841
## 10 SPAC57A7.12 0.740 0.901
## # ... with 20 more rows
Files with high difference in wide score, filtered for top 50% of RNA, out of frame. Saved to dvsaATG_highdiffw_outframe_Sp.txt.
## # A tibble: 128 x 3
## Gene aATG.scorekw d1.scorekw
## <chr> <dbl> <dbl>
## 1 SPAC222.04c 0.731 0.998
## 2 SPAC11E3.01c 0.708 0.959
## 3 SPBC1703.02 0.678 0.925
## 4 SPBC13G1.01c 0.759 0.967
## 5 SPCC31H12.08c 0.705 0.910
## 6 SPAC29E6.03c 0.717 0.913
## 7 SPAC1071.01c 0.723 0.918
## 8 SPAC10F6.17c 0.733 0.924
## 9 SPCC1739.14 0.728 0.919
## 10 SPBC25H2.16c 0.807 0.990
## # ... with 118 more rows
In input file SPombe_mitofates.txt.
## # A tibble: 16 x 5
## # Groups: enoughR, d1vsaw0p1, d1.framefac [?]
## enoughR d1vsaw0p1 d1.framefac Pred_preseq n
## <fct> <fct> <fct> <fct> <int>
## 1 Yes d1lo In No 451
## 2 Yes d1lo In Yes 51
## 3 Yes d1lo Out No 1765
## 4 Yes d1lo Out Yes 132
## 5 Yes d1hi In No 26
## 6 Yes d1hi In Yes 4
## 7 Yes d1hi Out No 103
## 8 Yes d1hi Out Yes 12
## 9 No d1lo In No 403
## 10 No d1lo In Yes 34
## 11 No d1lo Out No 1770
## 12 No d1lo Out Yes 108
## 13 No d1hi In No 31
## 14 No d1hi In Yes 4
## 15 No d1hi Out No 171
## 16 No d1hi Out Yes 11
Although the +5 T is striking.
Those genes are in this list:
## # A tibble: 31 x 9
## Gene aATG.scorekn d1.scorekn d1.frame d1.posATG RNA RPF RNA_noN
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SPBC… 0.722 0.991 2 128 144. 48.5 92.0
## 2 SPCC… 0.748 1.000 1 31 69.7 2.57 53.9
## 3 SPAP… 0.701 0.951 2 80 85.9 12.3 37.4
## 4 SPBC… 0.767 1.000 1 274 196. 51.7 215.
## 5 SPBC… 0.778 1.000 0 216 50.3 19.1 44.5
## 6 SPBC… 0.769 0.991 1 10 232. 30.6 121.
## 7 SPCC… 0.769 0.991 2 56 66.0 2.76 78.3
## 8 SPBC… 0.701 0.917 2 62 134. 43.6 118.
## 9 SPBC… 0.692 0.905 0 75 264. 173. 118.
## 10 SPBC… 0.757 0.966 0 24 92.7 26.9 90.6
## 11 SPAC… 0.748 0.952 2 122 311. 44.3 355.
## 12 SPAC… 0.715 0.912 2 71 53.9 3.03 62.8
## 13 SPBC… 0.758 0.951 2 86 64.7 3.47 21.6
## 14 SPBC… 0.724 0.908 2 95 394. 149. 275.
## 15 SPBC… 0.711 0.887 1 73 280. 88.4 466.
## 16 SPAC… 0.831 1.000 1 49 90.6 23.1 112.
## 17 SPAC… 0.732 0.900 0 3 331. 272. 694.
## 18 SPBC… 0.722 0.890 0 39 117. 41.2 111.
## 19 SPBC… 0.743 0.908 2 68 184. 73.5 225.
## 20 SPBC… 0.722 0.881 2 5 80.4 4.74 57.3
## 21 SPAC… 0.790 0.948 2 104 112. 33.8 98.5
## 22 SPAP… 0.764 0.920 2 77 38.2 12.3 93.8
## 23 SPBC… 0.755 0.908 2 68 84.9 18.4 88.3
## 24 SPAC… 0.707 0.859 1 73 90.1 9.58 92.5
## 25 SPCC… 0.780 0.930 2 41 69.8 2.94 24.9
## 26 SPAC… 0.688 0.837 0 42 58.5 11.4 43.0
## # ... with 5 more rows, and 1 more variable: RPF_noN <dbl>
Histidine tRNA Ligase, Hurrah!
Several mitochondrial ribosomal proteins, check these.
Mostly the 2nd ATG is not in frame though. Restricting to frame gives only Rex2 and HisRS.
## # A tibble: 94 x 8
## Gene aATG.scorekn d1.scorekn d1.posATG RNA RPF RNA_noN RPF_noN
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SPBC13… 0.778 1.000 216 50.3 1.91e+1 44.5 1.52e+1
## 2 SPBC2G… 0.692 0.905 75 264. 1.73e+2 118. 6.19e+1
## 3 SPBC16… 0.757 0.966 24 92.7 2.69e+1 90.6 2.46e+1
## 4 SPAC6C… 0.732 0.900 3 331. 2.72e+2 694. 7.98e+2
## 5 SPBC14… 0.722 0.890 39 117. 4.12e+1 111. 4.75e+1
## 6 SPAC3H… 0.688 0.837 42 58.5 1.14e+1 43.0 1.36e+1
## 7 SPBC2G… 0.738 0.870 9 172. 3.27e+1 69.1 7.11e+0
## 8 SPBC60… 0.758 0.868 84 23.6 1.69e+0 22.4 3.39e+0
## 9 SPAC24… 0.786 0.891 3 73.5 9.73e+0 90.9 1.57e+1
## 10 SPAC23… 0.763 0.864 36 176. 1.01e+2 231. 1.76e+2
## 11 SPAC57… 0.819 0.917 18 14.2 3.19e-1 13.7 4.68e-1
## 12 SPBC21… 0.760 0.852 3 208. 1.07e+2 221. 1.95e+2
## 13 SPBC2G… 0.864 0.952 105 794. 1.62e+3 1170. 2.54e+3
## 14 SPBC12… 0.795 0.878 60 97.4 9.91e+1 237. 3.49e+2
## 15 SPBC21… 0.925 1.000 9 259. 8.78e+1 335. 1.90e+2
## 16 SPAC4F… 0.917 0.991 60 169. 3.94e+1 162. 3.60e+1
## 17 SPCC16… 0.731 0.790 54 66.4 3.86e+2 72.9 7.99e+2
## 18 SPBC88… 0.779 0.838 60 136. 1.44e+1 148. 1.33e+1
## 19 SPBC16… 0.935 0.991 42 375. 7.17e+1 322. 4.96e+1
## 20 SPBP8B… 0.824 0.878 21 90.8 2.63e+1 51.1 1.78e+1
## # ... with 74 more rows
Some are dual-localized.
Predict that many more of the dual-localized things, such as aa-tRNA-synthetases, have non-ATG starts.